Frontiers in Digital Health
○ Frontiers Media SA
Preprints posted in the last 7 days, ranked by how well they match Frontiers in Digital Health's content profile, based on 20 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Basharat, A.; Hamza, O.; Rana, P.; Odonkor, C. A.; Chow, R.
Show abstract
Introduction Large language models are increasingly being used in healthcare. In interventional pain medicine, clinical reasoning is essential for procedural planning. Prior studies show that simplified prompts reduce clinical detail in AI-generated responses. It remains unclear whether this reflects knowledge loss or simply prompt-driven suppression of information. Methods We performed a controlled comparative study using 15 standardized low back pain questions representing common interventional pain questions. Each question was submitted to ChatGPT under three conditions, professional-level prompt (DP), fourth-grade reading-level prompt (D4), and clinician-directed rewriting of the D4 response to a medical level (U4[->]MD). No follow-up prompting was allowed. Three physicians independently rated responses for accuracy using a 0-2 ordinal scale. Clinical completeness was determined by consensus. Word count and Flesch-Kincaid Grade Level (FKGL) were also measured. Paired t-tests compared conditions. Results Accuracy was highest with professional prompting (1.76). Accuracy declined with the fourth-grade prompt (1.33; p = 0.00086). When simplified responses were rewritten for clinicians, accuracy returned to baseline (1.76; p {approx} 1.00 vs DP). Clinical completeness followed the same pattern showing DP 80.0%, D4 6.7%, U4[->]MD 73.3%. Fourth-grade responses were shorter and less complex. Upscaled responses were more complex and similar in length to professional responses. Inter-rater reliability was low (Fleiss {kappa} = 0.17), but trends were consistent across conditions. Conclusions Reduced clinical detail under simplified prompts appears to reflect constrained output rather than loss of knowledge. Clinician-directed reframing restores omitted content. LLM performance in interventional pain depends strongly on prompt design and intended audience.
Vollam, S.; Roman, C.; King, E.; Tarassenko, L.
Show abstract
A Wearable Monitoring System (WMS), comprising a chest patch, wrist-worn pulse oximeter, and arm-worn blood pressure device, was developed in preparation for a pilot Randomised Controlled Trial (RCT) on a UK surgical ward. The system was designed to support continuous physiological monitoring and early detection of deterioration. An initial prototype user interface was developed by the research team based on prior clinical experience and engineering knowledge. To ensure suitability for clinical practice, iterative user-centred refinement was undertaken through a series of clinician focus groups and wearability assessments. Six focus groups were conducted between November 2019 and May 2021 involving multidisciplinary healthcare professionals. Feedback from these sessions informed successive interface and system modifications. System development spanned the COVID-19 pandemic, during which the WMS was rapidly adapted and deployed to support clinical care on isolation wards. Feedback obtained during this period was incorporated into later versions of the system and provided a unique opportunity to examine changes in clinician priorities under pandemic conditions. Clinicians consistently prioritised alert visibility, alarm fatigue mitigation, parameter flexibility, and centralised monitoring. Notably, preferences regarding alert modality and access mechanisms evolved over time: early enthusiasm for mobile or smartphone-type devices shifted towards a preference for fixed, ward-based displays and audible alerts at the nurses station following pandemic deployment. Building on previous wearability testing in healthy volunteers, wearability testing using a validated questionnaire was completed by 169 patient participants during the RCT. The chest patch and pulse oximeter demonstrated high tolerability, whereas the blood pressure cuff showed poor wearability and was removed from the final system. These findings demonstrate the importance of iterative, clinician-led design for wearable WMS and highlight how extreme clinical contexts such as the COVID-19 pandemic can significantly reshape perceived requirements for safety-critical monitoring technologies.
Kim, S.; Guo, Y.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.
Show abstract
Social determinants of health (SDoH) are important for clinical care, but it remains unclear how much AI-captured social context is preserved after clinician editing in ambient documentation workflows. We retrospectively analyzed 75,133 paired ambient AI-drafted and clinician-finalized note sections from ambulatory care at a large academic health system. Using a rule-based NLP pipeline, we extracted 21 SDoH categories and quantified retention, deletion, and addition. SDoH appeared in 25.2% of AI drafts versus 17.2% of final notes. At the mention level, AI captured 29,991 SDoH mentions, of which 45.1% were deleted, 54.9% were retained with clinicians adding 3,583 new mentions. Insurance and marital status were most often deleted, whereas substance use and physical activity were more often retained. Deletion patterns also varied by specialty, supporting the need for specialty-aware ambient AI systems.
Blankson, P.-K.; Hussien, S.; Idris, F.; Trevillion, G.; Aslam, A.; Afani, A.; Dunlap, P.; Chepkorir, J.; Melgarejo, P.; Idris, M.
Show abstract
BackgroundRecruitment remains a major barrier to timely clinical trial completion. Trialshub is an LLM-powered, chat-based platform intended to help users identify relevant trials and connect with coordinators to streamline recruitment workflows. ObjectiveTo evaluate the perceived usability and operational value of Trialshub, and identify implementation considerations for real-world deployment. MethodsA usability test was conducted at Morehouse School of Medicine for the Trialshub application. Purposively selected participants included clinical research coordinators and individuals with and without clinical trial search experience. Participants completed a pre-test survey assessing demographics, digital health information behaviors, and familiarity with AI tools, followed by a moderated usability session using a Trialshub prototype. Users completed scenario-based tasks (locating a breast cancer trial, reviewing results, and initiating coordinator contact) using a think-aloud protocol. Task ratings, screen recordings, and transcribed feedback were analyzed descriptively and thematically, and reported. ResultsParticipants reported high comfort with using digital tools and moderate-to-high familiarity with AI. Trialshubs chat-first design, guided prompts, and checklist-style eligibility display were perceived as intuitive and reduced cognitive load. Fast access to trials and the coordinator-contact workflow were viewed positively. Key usability issues included uncertainty at step transitions, insufficient cues for selecting results and next actions, and inconsistent system reliability (loading delays, errors, and broken trial detail pages). Participants also noted redundant questioning due to limited conversational memory, requested improved filtering/sorting, and clearer calls-to-action. All participants indicated that Trialshub has strong potential to meaningfully improve clinical trial processes. ConclusionsTrialshub shows promise for improving trial discovery and recruitment workflows, with identified design implications for real-world deployment.
Tian, J.; Kurkova, V.; Wu, Y.; Adu, M.; Hayward, J.; Greenshaw, A. J.; Cao, B.
Show abstract
Patient-generated streaming data from wearable and digital technologies is increasingly promoted as a means of supporting mental health monitoring and clinical decision-making. While patient acceptance of these technologies has been reported, clinician perspectives remain underexplored despite their central role in determining whether streaming data are meaningfully integrated into routine care. This study explored clinicians experiences, as well as perceived facilitators and barriers, related to integrating patient-generated streaming data into routine mental health practice. A qualitative, exploratory interview study was conducted to examine clinicians experiences and perspectives on integrating patient-generated streaming data into mental health care. Semi-structured interviews were conducted with 33 clinicians, including family physicians (n=11), psychiatrists (n=12), and psychologists (n=10). Data were analyzed using reflexive thematic analysis guided by Braun and Clarkes six-step approach. Six themes were identified. Clinicians described variable use of digital and streaming technologies, ranging from routine engagement to deliberate non-use. Streaming data were viewed as clinically valuable when they provided longitudinal and objective insights, identified physiological and behavioural pattern changes, and supported patient engagement. However, clinicians emphasized that clinical usefulness was contingent on interpretability, contextual information, and relevance to decision-making. Major barriers included poor integration with electronic medical records, time constraints, data volume, limited organizational support, and uncertainty regarding data reliability and validity. Clinicians also expressed persistent concerns about privacy, governance, and regulatory oversight, highlighting the need for clear safeguards and accountability structures. Clinicians view patient-generated streaming data as a promising adjunct to mental health care, particularly for capturing longitudinal change between visits. However, meaningful clinical integration remains constrained by usability, workflow, organizational, and regulatory challenges, as well as limited confidence in data interpretation. Addressing these barriers through improved system integration, interpretive support, validation, and governance will be essential for translating the potential of streaming data into routine clinical practice.
Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.
Show abstract
Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.
Yamga, E.; Goudrar, R.; Despres, P.
Show abstract
Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.
Mwaka, E. S.; Nabukenya, S.; Kasiita, V.; Bagenda, G.; Rutebemberwa, E.; Ali, J.; Gibson, D.
Show abstract
Background: Mobile phone-based tools are increasingly used to collect data on non-communicable disease (NCD) risk factors, particularly in low-resource settings where traditional data collection systems face operational and infrastructural constraints. This study examined stakeholder perspectives on the use of enhanced mobile phone-based capabilities to support the collection of public health surveillance data on NCD risk factors in low-resource settings. Methods: An exploratory qualitative study was conducted between November 2022 and July 2023. Twenty in-depth interviews were conducted with public health specialists, ethicists, NCD researchers, health informaticians, and policy makers in Uganda. Thematic analysis was used to interpret the results. Results: Four themes emerged from the data, including benefits of using mobile phone capabilities for NCD risk factor data collection; ethical, legal, and social implications; perceived challenges of using such mobile phone capabilities; and proposed solutions to improve the utility of phone-based capabilities in data collection on NCD risk factors. Participants recognized the potential of mobile technologies to improve data collection efficiency and expand access to hard-to-reach populations. However, concerns emerged regarding inadequate informed consent, risks to privacy and confidentiality, unclear data ownership, and vulnerabilities created by inconsistent enforcement of data protection laws. Social concerns included low digital literacy, unequal access to mobile devices, and fear of stigmatization. Participants emphasized the need for transparent communication, robust data governance, and community engagement. Conclusion: Mobile phone-based systems can strengthen the collection of NCD risk factor data in low-resource settings; however, their benefits depend on addressing key ethical, legal, and social challenges. To ensure responsible deployment, digital health initiatives must prioritize participant autonomy, data protection, equity, and trust building. Integrating contextualized ethical, legal, and social considerations into design and policy frameworks will be essential to leveraging mobile technologies in ways that support inclusive and effective NCD prevention and control.
Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.
Show abstract
The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.
Van Oyen, C.; Mirza-Haq, N.
Show abstract
MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.
Martin, C. M.; henderson, i.; Campbell, D.; Stockman, K.
Show abstract
Background: The instability-plasticity framework proposes that multimorbidity trajectories periodically enter instability phases that are vulnerable to escalation but also potentially modifiable through relational intervention. Whether such phases commonly resolve without acute care, or predominantly progress to hospitalisation, has not been quantified at scale. Objective: To quantify instability window outcomes across a longitudinal monitoring cohort; to test whether the characteristics distinguishing admitted from resolved windows reflect within-patient trajectory dynamics or between-patient severity; and to characterise which patient-reported and operator-rated signals reliably precede admission, using both a curated pilot sub-cohort and the full monitoring cohort with an explicit cross-cohort comparison. Methods: Two complementary analyses were conducted on data from the MonashWatch Patient Journey Record (PaJR) relational telehealth system. Instability windows were identified algorithmically (>=2 consecutive calls with Total_Alerts >=3) across the full longitudinal dataset (16,383 calls, 244 patients, 2.5 years) and classified by linkage to ED and hospital admission data. Window characteristics were compared at window, patient, and paired within-patient levels. Pre-admission signal cascades were analysed in two configurations: a curated pilot sub-cohort (64 patients, 280 calls, +/-10-day window, 103 admissions, December 2016-September 2017) and the full monitoring cohort (175 patients, 1,180 pre-admission calls, +/-14-day window, December 2016-July 2019). A three-way cross-cohort comparison decomposed differences between the two configurations into pipeline and population effects. Results: 621 instability windows were identified across 157 patients (64% of the monitored cohort). 67.3% resolved without hospital admission or ED attendance, a rate stable across alert thresholds 1-5. In paired within-patient analysis (n = 70), duration in days (p = 0.002) and multi-domain breadth (p < 0.001) distinguished admitted from resolved windows; alert intensity did not. In the pilot sub-cohort, patient-reported illness prognosis (Q21) was the dominant pre-admission signal (GEE beta = +0.058, AUC = 0.647, p-BH = 0.018). This finding did not replicate in the full cohort: Q21 was non-significant (GEE beta = -0.008, p = 0.154, AUC = 0.507). Cross-cohort analysis identified selective curation of the pilot sub-cohort as the primary explanation. In the full cohort, six signals escalated significantly before admission after Benjamini-Hochberg correction: total alerts, health impairment (Q26), red alerts, self-rated health (Q3), patient concerns (Q1), and operator concern (Q34). Health impairment achieved the highest individual AUC (0.605) and showed the longest pre-admission lead. No individual signal exceeded AUC 0.61. Conclusions: Two thirds of instability phases resolve without hospitalisation, providing direct empirical support for trajectory plasticity as a clinically frequent phenomenon. Within the same patient, persistence - in duration and in the consistency of high-severity multi-domain flagging across calls - distinguishes trajectories that tip into admission from those that resolve. The Q21 signal reversal between cohorts illustrates how selective curation can produce compelling but non-replicable findings in monitoring research. In the full population, objective alert signals and operator judgement, rather than patient illness prognosis, carry the pre-admission signal
Kwon, C.-Y.; Lee, B.; Kim, M.; Mun, J.-h.; Seo, M.-G.; Yoon, D.
Show abstract
BackgroundHwa-byung (HB) is a Korean culture-bound syndrome characterised by prolonged suppression of anger and somatic complaints. No evidence-based digital therapeutic (DTx) has been developed for HB. We evaluated the feasibility, user experience (UX), and preliminary clinical effect of an acceptance and commitment therapy (ACT)-based DTx application, Hwa-free, for HB. MethodsAdults aged 19-80 years diagnosed with HB were enrolled in a four-week app-based intervention with assessment at baseline (Week 0), Week 2, Week 4, and Week 8 follow-up. The primary outcome was UX assessed via a 22-item survey at Week 4. Secondary outcomes included HB-related symptom and personality scales, depression, anxiety, anger expression, psychological flexibility, health-related quality of life, and heart rate variability. ResultsOf 45 screened, 30 were enrolled and 28 constituted the modified intention-to-treat population. Mean app use was 19.9 {+/-} 7.9 days (71.2% adherence over 28 days). Adverse events were infrequent and unrelated to the intervention. Positive response rates exceeded 80% for video content (items 2-4: 82.8-89.7%), HB self-assessment (86.2%), meditation therapy (86.2%), and in-app guidance (85.7%). Pre-post improvements from baseline to Week 4 were observed in 11 of 18 clinical scales, including HB Symptom Scale ({Delta} = -9.8, Cohens d = -0.92), Beck Depression Inventory-II ({Delta} = -13.3, d = -1.11), and state anger ({Delta} = -7.8, d = -0.96). The HB screening-positive rate declined from 100% at baseline to 55.6% at Week 8. ConclusionsHwa-free demonstrated adequate feasibility, acceptable UX, and preliminary evidence of clinically meaningful improvement in HB-related symptoms. Future randomised controlled trial is warranted. Trial registrationCRIS, KCT0011105
Nkosi-Mjadu, B. E.
Show abstract
BackgroundSouth Africas public healthcare system serves most of the population through approximately 3,900 primary healthcare clinics characterised by long waiting times and high volumes of repeat-prescription visits. No published pre-arrival digital triage system operates across all 11 official South African languages while aligning with the South African Triage Scale (SATS). This paper reports the design and preliminary safety validation of BIZUSIZO, a hybrid deterministic-AI WhatsApp triage system. MethodsBIZUSIZO delivers SATS-aligned triage via WhatsApp, combining AI-assisted free-text classification (Claude Haiku 4.5) with a Deterministic Clinical Safety Layer (DCSL) that overrides AI output for 53 clinical discriminator categories (14 RED, 19 ORANGE, 20 YELLOW) coded in all 11 official languages and independent of AI availability. A five-domain risk factor assessment can only upgrade triage level. One hundred and twenty clinical vignettes in patient language (English, isiZulu, isiXhosa, Afrikaans; 30 per language) were scored against a developer-assigned gold standard with independent blinded nurse review. A 121-vignette multilingual DCSL safety consistency check across all 11 languages and a 220-call post-hoc framing sensitivity evaluation (110 paired vignettes) were also conducted. ResultsUnder-triage was 3.3% (4/120; 95% CI: 0.9%-8.3%) with no RED under-triage; exact concordance was 80.0% (96/120) and quadratic weighted kappa 0.891 (95% CI: 0.827-0.932). One two-level under-triage was observed on a non-RED presentation (V072, isiXhosa burns vignette, ORANGEGREEN); one two-level over-triage was observed (V054, isiZulu deep laceration, YELLOWRED). In the framing sensitivity evaluation, AI-only classification achieved 50.9% RED invariance under adversarial framing; full-pipeline classification achieved 95.0% in four validated languages, with the DCSL rescuing 18 of 23 AI drift cases. ConclusionsA hybrid deterministic-AI triage system with DCSL-based emergency detection achieved zero RED under-triage and consistent RED detection across all 11 official languages. The 16.7% over-triage rate falls within published South African SATS ranges (13.1-49%). A single two-level under-triage event was observed on an isiXhosa burns vignette (ORANGEGREEN) and is discussed in Limitations. Findings are preliminary; prospective validation against independent nurse triage is the necessary next step.
Gausden, J.; Dujmovic, M.; Dunham, J. P.; Thakkar, B.; Bennet, T.; Burgess, C.; Young, A.; Whittaker, R. G.; Robinson, T.; Colvin, L.; O'Neill, A.; Pickering, A. E.
Show abstract
Neuropathy caused by chemotherapy is a common and debilitating side-effect of cancer treatment. With 30% of patients experiencing chronic neuropathy and with no good evidence-based treatments; early detection triggering chemotherapy regime modification remains the best option for prevention. Early detection is challenging because of a lack of diagnostic tools with sufficient longitudinal temporal precision and convenience for patient/clinical adoption. To tackle this problem, we developed SenseCheQ; enabling self-administered autonomous sensory testing which can be used by patients at home. We present the instrumental engineering approach taken to address the challenge, including haptic self-calibration combined with skin thermal-clamping protocols, and demonstrate robustly reliable performance in the face of environmental and user-related variance in home settings. We present prospective case studies of people having chemotherapy treatment for cancer, conducting regular unsupervised quantitative sensory testing to monitor their nerve function at home. These proof-of-principle studies show SenseCheQ can detect sub-clinical changes in nerve function, matching patient reported outcomes and lab-based sensory testing. This highlights SenseCheQs promise as a scalable biomarker platform for neuropathy-detection and therapeutic development.
Ferguson, D. J.
Show abstract
BackgroundClinical pharmacists, trainees, and educators rely on multi-database literature retrieval and structured evidence synthesis to answer drug-information questions. Existing workflows require navigation across PubMed, DailyMed, LactMed, interaction checkers, and specialty guideline repositories with manual de-duplication, appraisal, and synthesis. Commercial platforms that integrate these functions are costly and often unavailable in community, rural, and international training contexts. ObjectiveThis report describes the architecture of AuditMed, a single-file, browser-based clinical evidence audit platform, and reports preliminary stress-test results against a complex multi-morbidity case corpus. AuditMed is intended for research and educational use and is not a substitute for clinical judgment or validated commercial clinical decision-support systems. MethodsAuditMed integrates nineteen free, publicly available clinical and biomedical application programming interfaces into a six-stage Search [->] Select [->] Parse [->] Analyze [->] Infer [->] Create pipeline and supports browser-local patient-case ingestion with regex-based HIPAA Safe Harbor de-identification. Preliminary stress-testing was conducted against eleven cases (Cases 30 through 40) from the Complex Clinical Case Compendium Software Validation Suite, each featuring over twenty concurrent active disease states. For each case, the one-click inference pipeline was executed with default settings and the full Clinical Inference Report was captured verbatim. No retrieval-sensitivity, synthesis-fidelity, or time-to-answer endpoints were pre-specified; the exercise was qualitative and oriented toward pipeline behavior under extreme multi-morbidity. ResultsThe pipeline completed without fatal errors for all eleven cases and produced a structured Clinical Inference Report in each instance. Quantitative-finding detection performed as designed for hematologic parameters and cardiac biomarkers. Two parser defects were identified and are reproduced in the appendix: an age-as-fever regex-precedence defect affecting seven cases and a diagnosis-versus-medication parsing defect affecting one case. Evidence-linkage rate varied from zero evidence-linked statements in seven cases to eleven in one case, reflecting dependence of the inference layer on MeSH-indexed literature coverage of the specific case diagnoses. ConclusionsAuditMed is an early-stage, open-source platform whose value at this stage is in providing a free, transparent, auditable workflow for multi-source evidence synthesis with explicit uncertainty flagging. The preliminary results document both robust end-to-end completion under extreme case complexity and specific, reproducible parser defects that will be addressed before formal evaluation. Planned evaluation studies are described.
Polo Sanchez, M.; Lesmes, A. C.; Muni, N.; Vigneault, F.; Novak, R.
Show abstract
Background: Rett Syndrome (RTT) is a severe neurodevelopmental disorder affecting approximately 1 in 10,000 live female births worldwide. The Rett Syndrome Behaviour Questionnaire (RSBQ), remains one of the most widely used standardized behavioral assessment tools for RTT. However, the RSBQ was originally validated only in British English, limiting its applicability for Spanish-speaking caregivers and clinical centers across Latin America and Spain. Objective: The primary aim of this study was to develop and validate the comprehension of the Spanish translation of the RSBQ to ensure cultural and linguistic equivalence, enhance data reliability, and facilitate earlier, more accurate clinical assessments among Spanish-speaking RTT populations. Methods: Surveys were administered in two phases to Spanish-speaking caregivers between November 2023 and September 2025. Phase I consisted of 12 guided survey administrations with participants being able to ask clarifying questions and offer linguistic modifications of RSBQ questions. Phase II consisted of independent online administration of the refined Spanish RSBQ and a retest at least 7 days later. Participants were recruited through direct outreach and supported virtually during questionnaire completion. Results: Following data cleaning and quality control, a total of 51 caregivers successfully completed both surveys. The Spanish RSBQ demonstrated high caregiver comprehension and strong engagement across multiple Latin American countries, including Argentina, Mexico, and Peru. Responses were highly correlated between test and retest timepoints, and no question showed biased response distributions. A slight effect of response interval on test-retest correlation was observed, potentially indicating the impact of natural disease progression confounding retest evaluation for long (>80 day) intervals; however this effect did not impact the overall linguistic validation results as analysis of only <21 day test-retest responders confirmed the findings. Conclusions: This linguistic validation study represents the first formal step toward the clinical validation of the Spanish RSBQ, enabling broader inclusion of Spanish-speaking populations in RTT research. The collaborative, bilingual data collection strategy proved both feasible and effective, paving the way for multinational trials and expanding therapeutic accessibility through localized, patient-centered innovation.
Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.
Show abstract
Objective The aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and Methods We used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. Results We preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. Conclusion Our methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.
Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.
Show abstract
Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.
Yamga, E.; Murphy, S.; Despres, P.
Show abstract
Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.
Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.
Show abstract
Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.